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Abstract 

Background: MicroRNAs (miRNAs) are a class of small non-coding RNAs that regulate gene expression by targeting 
mRNAs for translation repression or mRNA degradation. Although many miRNAs have been discovered and studied 
in human and mouse, few studies focused on porcine miRNAs, especially in genome wide. 

Results: Here, we adopted computational approaches including support vector machine (SVM) and homology 
searching to make a global scanning on the pre-miRNAs of pigs. In our study, we built the SVM-based porcine 
pre-miRNAs classifier with a sensitivity of 100%, a specificity of 91.2% and a total prediction accuracy of 95.6%, 
respectively. Moreover, 2204 novel porcine pre-miRNA candidates were found by using SVM-based pre-miRNAs 
classifier. Besides, 116 porcine pre-miRNA candidates were detected by homology searching. 

Conclusions: We identified the porcine pre-miRNA in genome-wide through computational approaches by 
utilizing the data sets of pigs and set up the porcine pre-miRNAs library which may provide us a global scanning 
on the pre-miRNAs of pigs in genome level and would benefit subsequent experimental research on porcine 
miRNA functional and expression analysis. 
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Background 

MicroRNAs (miRNAs) are a family of ~22nt endogenous 
non-coding RNAs [1,2]. Mature miRNAs are usually 
cleaved from ~90nt miRNA precursors (pre-miRNAs) 
which are derived from processing of a long primary 
miRNA (pri-miRNA) by a ribonucluease [3]. Increasing 
evidences have shown that miRNAs play fundamentally 
important roles in various biological processes, including 
cell proliferation [4-7], development timing [8,9], apop- 
tosis [10,11], carcinogenesis [12-14], and response to dif- 
ferent environmental stresses containing disease [15-17]. 

Since the first lin-4 miRNA of C. elegans was discov- 
ered in 1992 [18], more than 19000 miRNAs have been 
found in animals and plants. Currently, the miRNA 
Registry Database (Release 17, April 2011; http:// 
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mirbase.org), a comprehensive and searchable database 
of published miRNA sequences, contains 16772 entries 
representing hairpin pre-miRNAs, expressing 19724 ma- 
ture miRNA products, in 153 species [19]. However, 
only 228 pre-miRNAs of pigs are included in this data- 
base, the number is far less than it really has. 

Pre-miRNAs have similar hairpin-shaped stem loop 
structure, high minimal folding free energy index, and 
high evolutionary conservation. They become the im- 
portant features which could be used in the computa- 
tional identification of pre-miRNA [20-22]. To date, 
computational prediction has been broadly used to iden- 
tify potential pre-miRNAs in animals and plants [23-25], 
because it is not limited by tissue specificity and time of 
miRNA expression. Especially, machine learning ap- 
proaches such as random forest (RF) [26], naive Bayes 
classifier [27], hidden Markov model [28,29] and SVM 
[30-32] have been adopted. 

Although previous studies have identified a certain 
number of porcine pre-miRNAs, few researches in com- 
putational identification of pre-miRNAs based on the 
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whole genome sequences are being done. Furthermore, 
most of the machine learning approaches are based on the 
data sets of human, while the features of the pre-miRNAs 
also exhibit the species-specificity. Therefore, we are aimed 
to identify the porcine pre-miRNA in genome-wide 
through computational approaches by utilizing the data sets 
of pigs in our study, which may provide us a global scan- 
ning on the pre-miRNAs of pigs in genome level. In our 
study, we built the SVM-based porcine pre-miRNAs classi- 
fier with a sensitivity of 100%, a specificity of 91.2% and a 
total prediction accuracy of 95.6%, respectively. As a result, 
2204 and 116 porcine pre-miRNA candidates were separ- 
ately detected by using SVM-based pre-miRNAs classifier 
and homology searching. 

Results and discussion 

Performance of the SVM-based pre-miRNAs classifier 

SVM-based porcine pre-miRNAs classifier was built by 
using the data sets of pigs. Interestingly, all of porcine 
pre-miRNAs of the test set were correctly detected by 
our classifier, which achieved a sensitivity (SE) of 100%, 
a specificity (SP) of 91.2% and a total prediction accur- 
acy (ACC) of 95.6%, respectively. The power of the pre- 
miRNAs classifier was given in Table 1. Moreover, the 
performance of the classifier was also tested by a ROC 
curve. As shown in the Figure 1, the classifier achieved a 
five-fold cross-validation rate of 99.54%. In a word, it 
indicated that our classifier was available for the predic- 
tion of porcine pre-miRNAs. Additionally, it also 
demonstrated that the comprehensive use of the pre- 
miRNAs features of the secondary structure and 
sequence information was an effective strategy in pre- 
miRNAs prediction. 

Xue et al. obtained an accuracy of 90% by using a set 
of features combining the local contiguous structures 
with sequence information to distinct the pre-miRNAs 
with that of pseudo pre-miRNAs [30], and those features 
have been used by several other pre-miRNA predicting 
methods [26,31,33]. Their studies demonstrated that 
those features were effective in pre-miRNA prediction. 
Thus, we also adopted those features in our study. Later, 
Jiang et al. found that the predicting performance signifi- 
cantly increased by combining the minimum of free en- 
ergy (MFE) of the secondary structure or p-value feature 

Table 1 Performance of the pre-miRNAs classifier on test 
sets. 
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Figure 1 ROC curve for the pre-miRNAs classifier on the test 
set. The curve with more areas has better performance of the 
classifier. It showed the classifier reached a well performance. 



Test set represents positive and negative set used to test the power of the 
pre-miRNAs classifier. Type represents the classification of the test set. Size is 
the number of the real or pseudo pre-miRNAs contained in test set. Accuracy 
is the percentage of the real or pseudo correctly recognized by pre-miRNAs 
classifier. 



with the local contiguous triplet structure composition 
feature. Their results indicated that a comprehensive fea- 
ture vector was able to extract more information of a 
primary sequence and reach a better prediction perform- 
ance [26]. Our classifier was capable of achieving a well 
prediction performance with an accuracy of 95.6% may 
be due to the using of a combined feature vector, be- 
cause additional seven features used in our study have 
been proved to be one part of the optimized features 
subset in pre-miRNAs prediction by Wang et al. [3]. 

Identification of pre-miRNAs candidates on pig genome 
using the SVM-based classifier 

Since the genome sequences contain the full information 
of a species and the database of non-coding RNA of pigs 
is quite incompletely, thus we used whole genome se- 
quences to construct the prediction set (PR-S). After 
splitting the pig genome, we obtained more than 222 
million short sequences. The PR-S constructed by short 
sequences passed by pre-filter was further distinguished 
by our SVM-based pre-miRNAs classifier. As pre-filter 
parameters would be very useful in filtering the pseudo 
pre-miRNAs from huge number of similar pre-miRNA 
sequences, those pre-filters were incorporated into the 
SVM-based classifier to predict novel pre-miRNAs. Ex- 
cept for the redundancy and the known pre-miRNAs, 
we finally got 2204 pre-miRNA candidates with the 
probability more than 0.99995 in the pig genome. They 
were formed into 1849 clusters according to their loca- 
tions in genome wide (inter-distance <=50kb [34]). 
Those pre-miRNA candidates were blasted with porcine 
CDS and other non-coding RNA (NONCODE v3.0, 
http://www.noncode.org/NONCODERv3/). The result 
shown that 6 novel pre-miRNAs (coverage >90%, iden- 
tities =100% with CDS) overlap with coding region. 
Namely, 2198 out of 2204 new pre-miRNAs are in the 
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non-coding region. And none of pre-miRNAs (coverage 
>90%, identities >90% with non-coding RNAs) were 
found that overlap with other non-coding RNAs. The 
procedure for predicting porcine pre-miRNAs was given 
as Figure 2. 

The large number of the novel pre-miRNA candidates 
indicated that there were still many unidentified pre- 
miRNAs in pigs. Previous studies estimated that the 
number of miRNAs have taken up to approximately 2- 
3% of the total number of genes in animal genomes [20]. 
According to our study, the number of the pre-miRNAs 
would be more than previous estimate. Expression pro- 
filing studies showed that most miRNAs were under the 
control of tissue-specific and development signaling, or 
both [35-37]. As a result, it may lead to a less number of 
miRNA identified by experimental methods and a low 
evaluate of pre-miRNAs' number. Indeed, in our studies, 
we regarded those pre-miRNA candidates as the real 
porcine pre-miRNAs in the view of bioinformatics. 
Meanwhile, those pre-miRNA candidates were set up to 
the porcine pre-miRNA library, the detail information of 
which was given in Additional file 1. 

To explore the location distribution of all the pre- 
miRNA candidates, we calculated the number of pre- 
miRNA candidates in each chromosome. And the 
chromosome 1 covered the maximum number of pre- 
miRNAs candidates, while the chromosome 18 included 
the minimum. To a large extent, the number was con- 
sistent with the length of chromosome, namely the big- 
ger of the chromosome the more number of pre-miRNA 
candidates it contained. The density analysis of pre- 
miRNA in chromosome showed that chromosome X, 8 
and 16 maintained the highest density of pre-miRNA. 
The chromosome 8 was also found that it had a high 
density of quantitative trait locus (QTL) (http://www. 
animalgenome.org/cgi-bin/QTLdb/SS/index). Thus, the 
result suggested other researchers should pay more at- 
tention to study the chromosome 8 of pigs in the future. 
The result of density analysis of pre-miRNA and QTL in 
chromosome was given in Additional file 2. 



At the same time, 215 unique pre-miRNAs were iden- 
tified in pigs by Solexa sequencing in another published 
study [38]. Based on the comparison this data with ours, 
we found that 49 (coverage >90%, identities >90% with 
predicting pre-miRNAs ) of above 215 unique pre- 
miRNAs were included in our study. In Chen et al.s 
study, it mainly focused on identifying miRNAs in por- 
cine backfat tissues. Tissues-specificity may lead to a 
bias on much more number of miRNAs identified in 
backfat tissues in their study, meanwhile some of their 
candidate miRNAs were unidentified by our method due 
to a limited length of 90-nt changed their features in our 
study. These may count for the low overlap rate. How- 
ever, the result of Chen et al. s study may still provide a 
piece of experimental evidence for our study. After the 
step of pre-filtering, a total of 160 known pre-miRNAs 
were retained in PR-S. 181 sequence fragments (cover- 
age >90%, identities =100% with known pre-miRNAs) 
represented 115 known pre-miRNAs were detected by 
classifier. Namely, the sequence fragments of the 
known pre-miRNAs in the PR-S could be detected 
with the coverage of 72% (115 out of 160). The details 
those known pre-miRNAs sequence fragments were 
given in Additional file 3. There are several possible 
reasons accounting for that not all the reported por- 
cine pre-miRNAs in miRNA Registry Database were 
covered in our studies. Firstly, not all the pre-miRNA 
sequences are expressed in the order of the genome 
sequence due to the RNA editing [39,40] , such as 
mir-381,mir-1271. According to our observation, 184 
out of known 224 pre-miRNAs are completely identi- 
cal to the sequence of the genome, thus 40 known 
pre-miRNA sequences unmapped to the genomic se- 
quence data were filtered. Secondly, in order to reduce 
the pseudo pre-miRNAs as more as possible, the pre- 
filter parameters setting is up to some reported pre- 
miRNAs, such as the value of the minimal folding free 
energy index (MFEI). 160 out of 184 known pre- 
miRNAs were retained (20 known pre-miRNA were 
missed) after this step. Thirdly, the length of the short 
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Figure 2 Flowchart of the porcine pre-miRNA prediction procedure. 
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sequence is limited to 90-nt, while some features of 
pre-miRNAs (such as adjusted minimal folding free 
energy (N(AMFE)) and the adjust number of paired 
nucleotides (N(ANNB)) have connection with the se- 
quence length [32,41], which may influence the fea- 
tures of 45 reported pre-miRNAs and lead them to be 
undetected. 

Although the classifier produced a specificity of 91.2%, 
the candidate hairpins could be lead to a certain number 
of false positives in genome- wide prediction. Thus, the 
next problem removing those pseudo pre-miRNAs in 
the library is needed to be considered deeply. 

Identification of the pre-miRNAs candidates using the 
homologous searching 

Since the pre-miRNA candidate sequences were split 
from genome with a specified length of 90-nt which may 
lead some of them undetected by our SVM classifier and 
the coverage of some model species with our SVM-based 
classifier result (coverage >85%, identities >85% with 
model species known pre-miRNAs) were 8% (human), 
12% (mouse), 22% (rat), 16% (cow) and 31% (dog), which 
was not so high. The SVM-based classifier s training set 
was composed by the porcine known pre-miRNAs to 
predict the novel pre-miRNAs of pigs. The feature of 
pre-miRNAs exhibits the species-specificity. It may cause 
our SVM-based classifier have some biases to detect 
more pre-miRNA possessed only by pigs. The species- 
specificity and homologous porcine pre-miRNAs uniden- 
tified in model species may contribute to the low overlap 
rate. It was necessary to make it up by some other com- 
putational methods. At present, besides the SVM classi- 
fier the homologous searching is also a widely used 
method for identifying the pre-miRNAs, because the pre- 
miRNAs have a highly conservation among the different 
species [20]. What's more, in recent years, a large num- 
ber of new pre-miRNAs were identified in some model 
species, such as Mouse, Human. Up to now, according to 
the records of miRNA Registry Database (Release 17, 
April 2011; http://mirbase.org ), it contains human 
(1424), mouse (720), rat (408), cow (662) and dog (323). 
While, there are only 228 pre-miRNAs in pig. Therefore, 
it is quite necessary for us to do a homologous searching 
once again to find the new pre-miRNAs of porcine by 
using the identified pre-miRNAs in the other species. 

According to the criteria mentioned in homologous 
searching method, we found 116 new pre-miRNAs can- 
didates, and the detail information of which was given in 
Additional file 4. Interestingly, some pre-miRNAs candi- 
dates were mapped to more than one location of chro- 
mosomes. Guo et al. thought that cross-mapping events 
in pre-miRNAs revealed potential miRNA-mimics and 
evolutionary implications [42]. The newly identified por- 
cine pre-miRNAs candidates belong to different miRNA 



families, such as miR-1282, miR-3059, miR-3120, miR- 
3618. Among them, miR-3120 initially identified from 
melanoma [43] and miR-3618 from human cervical can- 
cer and normal cervices [44] have a highly conservation 
with pigs. We have also compared this result with the 
SVM-based and found no overlap between them. Actu- 
ally, there were some of them passing SVM model 
before filtering in our study. However, when the predic- 
tion probability was set as more than 0.99995 to reduce 
false positive, they were filtered out with a result of no 
overlap between homology search and SVM-model 
candidates. There is no doubt that the high conservation 
of pre-miRNAs among the species also provides us a 
rapid way to identify the pig pre-miRNAs. This would 
be helpful to further enrich the resource of pre-miRNAs 
databases. 

Conclusions 

In conclusion, we built the SVM-based pre-miRNAs clas- 
sifier using the known pre-miRNAs and CDS sets of the 
pigs. From the porcine genome, we discovered 2204 new 
pre-miRNAs candidates by our SVM-based classifier and 
116 pre-miRNAs candidates by homology searching. Our 
study would provide guidance on further experimentally 
verifying swine pre-miRNA in the future and offer the op- 
portunity to research gene function and the genetic mech- 
anism of complex traits in genome level. 

Methods 

Sequence data collection 

The porcine genomic sequences were available from UCSC 
database (Mar 2010, http://hgdownload.cse.ucsc.edu/gold- 
enPath/susScr2/bigZips/). The precursor sequences of 
known miRNAs of Homo sapiens (human), Mus musculus 
(mouse), Rattus norvegicus (rat), Bos Taurus (cow), Canis 
familiaris (dog) and Sus scrofa (pig) were obtained from 
miRNA Registry Database (Release 17, April 2011; http:// 
mirbase.org) [19]. The porcine protein coding regions 
sequences (CDS) were downloaded from NCBI (ftp://ftp. 
ncbi.nih.gov/genomes/Sus_scrofa/RNA/), which were used 
as the pseudo pre-miRNA data. 

The length of the pre-miRNAs sequences (LS) 

The statistical length distribution of porcine pre-miRNA 
from miRNA Registry Database is that 86% of them 
within 75-105 nt. In our study, both the porcine gen- 
ome sequences and CDS were divided into short 
sequences using a 90-nt sliding window with 9-nt incre- 
ments at one time [3,33]. 

The complexity of the sequences 

Low- complexity of the sequences, such as those with 
single nucleotide repeated > 8 times (for example, 
A A AAA AAA), dinucleotides repeated > 7 times (for 
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example, AG AG AG AG AG AG AG), trinucleotides repea- 
ted > 4 times (for example, ATGATGATGATG), were 
removed for further analysis, since we observed few known 
pre-miRNA possessed such sequences. Additionally, the 
sequences with the region of gap were removed. 

MFE feature 

MFE of the secondary structure was predicted by the 
Vienna RNA software package (RNAfold) (Version 1.8.5; 
http://wwwtbi.univie. ac.at/~ivo/ RNA/) [45,46]. Previ- 
ous studies indicated that pre-miRNAs have a high nega- 
tive MFE and MFEI, which is a useful criterion to 
distinguish pre-miRNAs from all coding or non-coding 
RNAs [41]. The MFEI was calculated by the equation: 
MFEI = (- 100 x MFE/LS)/(G + C). 

The three characteristics related to MFE were used as 
the feature vectors in SVM, and they were defined as 



follows: 

Af(MFE) = (-MFE)/1000 (1) 

Af(MFE) = MFEI/10 (2) 

Af(AMFE) = (-MFE)/(10 x LS) (3) 



Base-pairings and the secondary structure features 

Because nucleic acid G can be paired with C or U, the 
base-pairings on the stem of the hairpin structure 
included the GU wobble pairs. And the threshold of the 
minimum base-parings of real pre-miRNA was 18. In- 
deed, the stem of the hairpin structure is highly con- 
served in pre-miRNAs, so we still only considered the 
stem regions of the pre-miRNA. The number of paired 
nucleotides (NNB), the adjust number of paired nucleo- 
tides (ANNB) and the number of nucleotides of the stem 
parts (NNS) were utilized as three feature vectors, 
defined as follows: 

Af(NNB) = NNB/ 1000 (4) 

Af(ANNB) = NNB/LS (5) 

Af(NNS) = NNB /NNS (6) 

Meanwhile, we denoted the contents of GC as follows: 

N(GC) = GC/1000 (7) 

Besides, seven other features, including the structural 
diversity (N(Diversity)) (8), the frequency of the MFE 
structure (N(Freq/100)) (9) [46], adjusted base pair dis- 
tance (N(dD)) (10) [47], average distance between in- 
ternal loops (N(D_interlp/1000)) (11), the ratio of |A-U| 
to the length of sequence (N(|A-U|/LS)) (12), the length 
of the longest relaxed symmetry region (N(l_rsym_rgn/ 
100)) (13) and the length of the longest symmetry region 
(N(l_sym_rgn/100)) (14), which were found as the 



optimized features for pre-miRNAs prediction according 
to the studies of Wang et.al [3], were also adopted. 

The local adjacent sequence-structure features 

Previous studies have shown that local sequence features 
play a crucial role in pre-miRNAs [48]. Additionally, 
Xue et al. found that the distributions of local conti- 
guous sub-structures of pre-miRNAs are significantly 
distinguished with that of pseudo pre-miRNAs [30]. 
Therefore, in our study, we also characterized the sec- 
ondary structure of pre-miRNAs by combining of the se- 
quence information with the local contiguous structures. 

There are only two conditions for each nucleotide in 
the predicted secondary structure by RNAfold [45], 
paired or unpaired, denoted by brackets "(" and dots 
respectively. The left bracket "(" represents that paired 
nucleotide located near 5'-end which can be paired with 
another nucleotide at the 3'-end indicated by a right 
bracket ")"• We used "(" for both situations without dif- 
ferentiating "(" or ")", because no evidence has indicated 
that mature miRNAs have a preference of the 3' or 5' 
arms of their hairpin precursors. Obviously, for any 3 
adjacent nucleotides, there are eight possible structure 
units: "(((", "((.", %(", ".((", V, ".(• "> "~C> • Further- 
more, by considering the left nucleotide among the three 
[31], there are 32 possible sequence-structure units, left- 
triplet coding ,denoted as "A(((", "U(((", "A((.", etc. as 
shown in Additional file 5 [31]. Here, we only consid- 
ered the stem regions of a pre-miRNA by excluding the 
external single-stranded parts and the terminal loop. 
Similar features have been adopt by pioneer work, e.g. 
that Zhao et al. [31]. The frequency of each left- triplet 
coding of pre-miRNA was counted to create the 32 
feature vectors. After normalizing, the frequency was 
used as input features for SVM. Combining with 14 fea- 
tures above, in all 46 feature vectors (summarized in 
Additional file 6) were taken as the input of SVM. 

The pre-filter parameters of secondary structure features 

Each sequence secondary structure, predicted by the 
Vienna RNAfold, was passed through a set of filter pa- 
rameters. The filtering parameters [33,49,50] related to 
some terms of secondary structures were given as 
Additional file 7 [3], which were shown below. 

(a) The number of hairpin loops = 1; 

(b) The number of symmetrical loops < 6. 

(c) The number of asymmetrical loops < 4. 

(d) The number of bulges < 5. 

(e) The total number of symmetrical and asymmetrical 
loops < 8. 

(f ) The total number of symmetrical, asymmetrical 
loops and bulges <10. 

(g) The number of the base pairing >17. 
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(h) The value of ANNB is between 0.3-0.43. 

(i) The length of symmetrical loops < 5. 
(j) The length of asymmetrical loops <6. 
(k) The length of bulges < 6. 

(1) The MFE < -15kal/mol. 
(m)The MFEI >0.7. 

(n) The percentage of the GC contents is between 
30-70%. 

SVM data set 

Among the 228 known porcine pre-miRNAs, whose sec- 
ondary structures with no multiple loops were consid- 
ered. 224 pre-miRNAs, covering more than 98% of all 
the reported porcine pre-miRNAs, were retained. We 
randomly extracted 184 pre-miRNAs from them as one 
part of training set (TR-S) and the remaining 40 pre- 
miRNAs formed into the test set 1 (TE-S1). 

A pseudo pre-miRNAs set was collected from the por- 
cine CDS and 5677 pseudo pre-miRNAs were selected 
due to their similar stem-loop structures to real pre-miR- 
NAs. The criteria for extracting the pseudo pre-miNRAs 
from CDS segment was complied with the pre-filter 
parameters of the secondary structure features above. 
184 pseudo pre-miRNAs selected randomly from the 
pseudo pre-miRNAs set composed another part of TR-S. 
Furthermore, we randomly took out 1000 pseudo pre- 
miRNA from the remaining pseudo pre-miRNAs set as 
test set 2 (TE-S2). 

In addition, the porcine genome sequence fragments 
split from genome using a 90-nt sliding window with 9-nt 



increments at one time, passed the pre-filter parameters 
of secondary structure features (including (a),(g),(l) and 
(MFEI>0.6)), were collected for further identifying by 
SVM classifier and constructed the PR-S. The compo- 
sition of each set was shown in Figure 3. 

SVM 

SVM, based on statistical theory [51], has a good 
generalization ability [52]. Therefore, in our study, SVM 
was adopted as a classifier to identify the real and 
pseudo pre-miRNAs. It was trained by the TR-S with the 
performance estimated by TE-S and applied to the PR-S. 
A 46-dimension feature vector referred to the above was 
taken as the input of SVM and the output was the num- 
ber value "1", which means the true, or "-1" indicating 
the false. 

In our study, we downloaded a widely used software 
package Libsvm (Version 3.1, April 2011; http://www. 
csie.ntu.edu.tw/~cjlin/libsvm/) [53] to carry out our 
work. In order to acquire SVM classifier with optimal 
performance, we applied five cross-validation in model 
training, which could obtain the optimal penalty para- 
meter C and the RBF kernel parameter g. Meanwhile, 
the performance of the SVM classifier was evaluated by 
following the assessment system used in RF [26]. 

Homologous searching 

We chose pre-miRNAs of five other mammalian species 
(including human, mouse, rat, cow and dog), which have 
a highly homology with pigs. Firstly, we removed the 



Negative set 



Predictive set 




Figure 3 The composition of each set including the training set (TR-S), testing set (TE-S1 and TE-S2) and predictive set (PR-S). 184 real 
and pseudo porcine pre-miRNAs are randomly extracted from positive set (224 known real porcine pre-miRNAs) and negative set (5677 porcine 
CDS), respectively, and then they form into the training set. The remaining 40 real porcine pre-miRNAs compose the test set 1 (TE-S1). 1000 
pseudo pre-miRNAs from the remaining negative set are randomly selected as test set 2 (TE-S2). Both TE-S1 and TE-S2 are used to test the 
performance of the SVM-based pre-miRNAs classifier. The predicting set (PR-S) is constructed by the porcine genome sequence fragments passed 
the pre-filter parameters of secondary structure features. 
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pre-miRNAs which have a highly homologous with 228 
known porcine pre-miRNAs from the total pre-miRNAs 
of five species by utilizing the software of BLAST (ncbi- 
blast>2.2.25+; ftp://ftp.ncbi.nlm.nih.gov/blast/executables/ 
blast+ /LATEST/) [54]. Next, the remaining pre-miRNAs 
were blasted with the genome sequence of pigs and the se- 
quence fragments (coverage >85%, identities >85% with 
pre-miRNAs) were retrieved from genome. Lastly, after 
discarding the redundant sequences, the sequences were 
regarded as pre-miRNA candidates if they accorded with 
the following criteria [55,56] :(i) an RNA sequence can fold 
into an stem-loop hairpin structure; (ii) predicted second- 
ary structures had MFE less than -15kcal/mol;(iii) mini- 
mum base pairings on the stem of the hairpin structure 
isl8;(iv) no multiple loops; (v) the GC contents is between 
30-70%. 
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